Batch Reinforcement Learning with State Importance

نویسندگان

  • Lihong Li
  • Vadim Bulitko
  • Russell Greiner
چکیده

We investigate the problem of using function approximation in reinforcement learning where the agent’s policy is represented as a classifier mapping states to actions. High classification accuracy is usually deemed to correlate with high policy quality. But this is not necessarily the case as increasing classification accuracy can actually decrease the policy’s quality. This phenomenon takes place when the learning process begins to focus on classifying less “important” states. In this paper, we introduce a measure of state’s decision-making importance that can be used to improve policy learning. As a result, the focused learning process is shown to converge faster to better policies. 1 Problem Formulation and Related Work Reinforcement learning (RL) [11] provide a general framework for many sequential decision-making problems and has succeeded in a number of important applications. Let S be the state space, A the action set, and D the start-state distribution. A policy is a mapping from states to actions: π : S 7→ A. The stateand action-value functions are denoted by V (s) andQ(s, a), respectively [11]. The quality of a policy π is measured by policy value [10]: V(π) = Es0∼D V (s0). A RL agent attempts to learn the optimal policy with maximal value: π∗ = argmaxπ V(π). The corresponding optimal stateand action-value functions are denoted by V ∗(s) and Q∗(s, a), respectively. In this paper, we focus on classification-based RL methods where a policy π is represented as a classifier labeling state s with action π(s). Then learning a policy π is reduced to learning a classifier [4, 6, 7, 12]. Recent implementations of this idea have demonstrated promising performance in several domains by learning high-quality policies through high-accuracy classification. It should be noted, however, that in sequential decision-making the classification error is not the target performance measure of a reward-collecting agent. Consequently, increasing classification accuracy can actually lower the policy value [9]. An intuitive explanation is that not all states are equally important in terms of preferring one action to another. Therefore, the classificationbased RL methods can be improved by focusing the learning process on more important states. The expected benefits include faster convergence to better policies. We examine the so-called batch reinforcement learning in which the policy learning occurs offline. Such a framework is important wherever online learning is not feasible (e.g., when the reward data are limited), and therefore a fixed set of experiences has 1 Due to space limitation, only deterministic policy with binary actions are discussed. Details and extensions can be found in [9]. to be acquired and used for offline policy learning [2, 8]. In particular, we are interested in a special case where the state space is sparsely sampled and the optimal action values for these sampled states are computed or at least estimated. The sampled states together with their optimal action values form the training data for batch learning: TQ∗ = {〈s, a,Q∗(s, a)〉|s ∈ T ⊂ S, a ∈ A}, where T is the sparsely sampled state space. The assumption of knowing the optimal action values may at first seem unrealistic. However, a technique called full-trajectory-tree expansion [5, 8] can be used to compute or estimate such values. This technique is especially useful in domains where good policies generalize well across problems of different sizes: the agent can first obtain a good policy on problems with tractable state space where the technique is applicable, and then generalize the policy to larger problems. With the training data TQ∗ , the optimal actions can be computed: ∀s ∈ T , a∗(s) = argmaxaQ(s, a) and the training data for learning a classifier-based policy are formed: TCI = {〈s, a∗(s)〉|s ∈ T }. Finally, the optimal policy is approximated by minimizing the classification error: π̂∗ CI = argminπ̂∗ ∑ s∈S I(π̂∗(s) 6= π∗(s)), where I(A) = 1 if A is true and 0 otherwise. The subscript CI (cost-insensitive) is in contrast to its cost-sensitive counterpart that will be introduced in the next section. 2 Batch Reinforcement Learning with State Importance In contrast to the cost-insensitive algorithm outlined in the previous section, a novel RL algorithm based on cost-sensitive classification is proposed, which uses the state importance values as misclassification costs. As a result, the learning process focuses on important states thereby improving the convergence speed as well as the policy value. Intuitively, a state is important from the decision-making point of view if making a wrong decision in it can have significant repercussions. Therefore, the importance of a state s, G∗(s), is defined as: G∗(s) = Q∗(s, a∗(s)) − Q∗(s, ā(s)), where a∗(s) is the optimal action and ā(s) is the other (sub-optimal) action 2. Similarly, the importance of state s under policy π, G∗(s, π), is defined as: G∗(s, π) = Q∗(s, a∗(s))−Q∗(s, π(s)). Clearly, if π(s) = a∗(s), then G∗(s, π) = 0; otherwise, G∗(s, π) = G∗(s). It is desirable for the agent to approximate π∗ by agreeing with it at important states. One way is to use the state importance values as the misclassification costs: π̂∗ CS = argminπ̂∗ ∑ s∈S ( G∗(s) · I(π̂∗(s) 6= π∗(s) ) . Then learning the policy is reduced to cost-sensitive classification where s is the attribute, a∗(s) is the desired class label, and G∗(s) is the misclassification cost. Thus, given the training data TQ∗ , the agent can first compute G∗(s) for all states s ∈ T to form a training set: TCS = {〈s, a∗(s), G∗(s)〉 | s ∈ T }, and then compute π̂∗ CS using cost-sensitive classification techniques. A question of both theoretical and practical interest is whether it is preferable to solve π̂∗ CS as opposed to π̂ ∗ CI. It is shown [9] that: (i) the policy value is lower-bounded in terms of the cost-sensitive classification error of π̂∗ CS; however, (ii) if the cost-insensitive classification error of π̂∗ CI is not zero, then no matter how small the error is, the resulting policy can be arbitrarily close to the worst policy in terms of policy value. Empirical support was gained from experiments on a series of 2D grid-world domains. 2 NB: Such a definition of G∗(s) is similar to the advantage introduced by Baird [1]. 3 Summary and Future Work Classification-based policy acquisition is an interesting development in RL that attempts to gain a better policy by increasing the classification accuracy. However, the correlation between policy value and classification accuracy is non-monotonic as the states are not equally important. We then proposed a measure of state’s decision-making importance and outlined a way to utilize such values in a class of RL problems. Advantages of such a method are supported both theoretically and empirically. The promising initial results open several avenues for future research. First, when computing resources are limited, it is possible to focus learning only on the more important states by ignoring the others. However, the extent to which such an a priori pruning may lead to overfitting needs to be explored. Another area for future research is an investigation of the extent to which this approach depends on the cost-sensitive classifier. In particular, it would be interesting to investigate the benefits of applying modern cost-sensitive classification techniques (e.g., cost-proportionate example weighting [13] and boosting [3]) in focused learning.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Learning with Raw Image Pixels as Input State

We report in this paper some positive simulation results obtained when image pixels are directly used as input state of a reinforcement learning algorithm. The reinforcement learning algorithm chosen to carry out the simulation is a batch-mode algorithm known as fitted Q iteration.

متن کامل

Batch mode reinforcement learning based on the synthesis of artificial trajectories

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value functi...

متن کامل

Explanation-based Learning and Reinforcement Learning: a Uniied View

In speedup learning problems where full descriptions of operators are known both explanation based learning EBL and reinforcement learning RL methods can be applied This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state Most RL methods perform this propagation on a state by state basis while EBL metho...

متن کامل

Reinforcement learning for robot soccer

Batch reinforcement learning methods provide a powerful framework for learning efficiently and effectively in autonomous robots. The paper reviews some recent work of the authors aiming at the successful application of reinforcement learning in a challenging and complex domain. It discusses several variants of the general batch learning framework, particularly tailored to the use of multilayer ...

متن کامل

Explanation-based Learning and Reinforcement Learning: a Uniied View

In speedup-learning problems, where full descriptions of operators are always known, both explanation-based learning (EBL) and reinforcement learning (RL) can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. RL performs this propagation on a state-by-state basis, while EBL computes ...

متن کامل

Batch Reinforcement Learning

Batch reinforcement learning is a subfield of dynamic programming-based reinforcement learning. Originally defined as the task of learning the best possible policy from a fixed set of a priori-known transition samples, the (batch) algorithms developed in this field can be easily adapted to the classical online case, where the agent interacts with the environment while learning. Due to the effic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004